import pandas as pd
import numpy as np
# rng = np.random.default_rng(2)
# import holoviews as hv
# from holoviews import opts
# hv.extension('bokeh')
!dvc pull
Everything is up to date.
Investigate the BERTopic documentation (linked), and train a model using their library to create a topic model of the flavor_text data in the dataset above.
topic_model.py, load the data and train a bertopic model. You will save the model in that script as a new trained model objectdvc.yaml that has mtg.feather and topic_model.py as dependencies, and your trained model as an outputtopic_visualization interactive plot see docsrelease_date column as timestamps. # Read in magic data
df = (
pd.read_feather('../../../data/mtg.feather')
.dropna(subset=['flavor_text', 'text'])
)
# Load trained BERTopic model
from bertopic import BERTopic
# topic_model = BERTopic.load("bertopic_model")
# topic_model = BERTopic.load("my_model_custom_embeddings")
# topic_model = BERTopic.load("my_model_no_min_topic_size")
topic_model = BERTopic.load("my_model_embeddings")
I did a number of different iterations when training a BERTopic model. The last one is the one I decided to use for this submission.
I did the following:
Preprocess the flavor text to decontract words ("won't" changed to "will not", for example)
Created a custom embedding model using SentenceTransfomer that I trained on the flavor_text corpus itself (See "Custom Embeddings")
Included a CountVectorizer inside BERTopic to include English stopwords and ngram_range = (1,1) (See "I am I facing memory issues. Help!")
Set min_cluster_size in HDBSCAN equal to 100 (See "How do I reduce topic outliers")
Set nr_topic to 'auto' so BERTopic can merge topics that are similar to one another
from my_functions import preprocess
# Preprocess the flavor_text
docs = preprocess(df.flavor_text)
# Fit transform
topics, probs = topic_model.fit_transform(docs)
topic_model.visualize_topics()
100%|█████████████████████████████████| 29635/29635 [00:00<00:00, 100179.16it/s]
Batches: 0%| | 0/927 [00:00<?, ?it/s]
OMP: Info #273: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
len(topic_model.get_topics())
46
Because BERTopic is highly stochastic nature of UMAP (See "Why are the results not consistent between runs?") it took me a while to figure out the general layout of the topics.
Thanks to the intertopic distance chart, I could see 6 somewhat-distinct clusters in the topics. Therefore, I decided to further reduce the number of topics before I have to name them.
new_topics, new_probs = topic_model.reduce_topics(docs, topics, nr_topics=6)
topic_model.visualize_topics()
set(new_topics)
{-1, 0, 1, 2, 3, 4, 5}
topic_model.get_topic_info()
| Topic | Count | Name | |
|---|---|---|---|
| 0 | -1 | 18869 | -1_life_death_like_world |
| 1 | 0 | 5893 | 0_power_dead_nature_strength |
| 2 | 1 | 1752 | 1_fight_sword_blade_battle |
| 3 | 2 | 1065 | 2_sun_light_darkness_night |
| 4 | 3 | 782 | 3_prey_hunt_hunter_werewolves |
| 5 | 4 | 724 | 4_mage_magic_wizard_mages |
| 6 | 5 | 550 | 5_hear_silent_roar_sound |
From the reduced topics above, these are the names I've come up with the following names:
Topic -1: outliers.
Topic 0: Earth/ Nature
Topic 1: Death
Topic 2: Elves/ Forest Creatures
Topic 3: Wolves/ Hunters
Topic 4: Godly Gifts
Topic 5: War
# Convert release_date to a list
timestamps = df.release_date.tolist()
# Using new_topics because it's been reduced to 6 topics
topics_over_time = topic_model.topics_over_time(docs, new_topics, timestamps)
281it [00:05, 49.72it/s]
topic_model.visualize_topics_over_time(topics_over_time, topics = [1,2,3,4,5])
I spent a lot of time on part 1 because I wanted to play with the parameters and really understand BERTopic. Thanks to me implementing a custom embedding on the corpus and a second layer topic reduction based on the clusters I observed, I think the final product (the Dynamic Topic Models) chart doesn't seem to outrageous.
Using only the text and flavor_text data, predict the color identity of cards:
Follow the sklearn documentation covered in class on text data and Pipelines to create a classifier that predicts which of the colors a card is identified as.
You will need to preprocess the target _color_identity_ labels depending on the task:
multiclass.py, again load data and train a Pipeline that preprocesses the data and trains a multiclass classifier (LinearSVC), and saves the model pickel output once trained. target labels with more than one color should be unlabeled! multilabel.py, do the same, but with a multilabel model (e.g. here). You should now use the original color_identity data as-is, with special attention to the multi-color cards. in dvc.yaml, add these as stages to take the data and scripts as input, with the trained/saved models as output.
in your notebook:
The preprocessing steps I did in both multiclass.py and multilable.py are the same as the steps I did for flavor_text in topic_model.py. The preprocess function for the text column(s) came from my_functions.py to ensure that I used the same preprocessing steps in all parts of this notebook, which includes:
After apply the preprocess() functions on the text and flavor_text columns, I concatenated both columns (will be shown below) - this is my X features.
In the multiclass case, because I needed to consider cards with more than one color identities an unlabeled card, I needed to drop them (along with cards with no color identity [ ] ) so I could fit a multiclass classifier on the color_identity column.
In the multilabel case, I applied a MultiLabelBinarizer on the color_identity column. This way, I did not need to drop cards that had no color_identity - they would simply turn to an array of [0, 0, 0, 0, 0] (since we have 5 colors), whereas a cards with 4 different colors might look something like [1, 0, 1, 1, 1]
# Read in magic data
df = (
pd.read_feather('../../../data/mtg.feather')
.dropna(subset=['flavor_text', 'text'])
.reset_index(drop=True)
)
# Keep rows where the len of color_identity is 1
df = df[df['color_identity'].map(lambda d: len(d)) == 1].reset_index(drop=True)
# Because all lists in df.color_identity now has length 1, take the first item from each list
y = df.color_identity.apply(lambda x: x[0])
y.shape
(22418,)
from my_functions import preprocess
# Preprocess text columns
clean_flavor_text = preprocess(df.flavor_text)
clean_text = preprocess(df.text)
X = []
# Concatenate the 2 text columns together
for i in range(len(clean_text)):
text_concat = clean_text[i] + ". " + clean_flavor_text[i]
X.append(text_concat)
len(X)
100%|█████████████████████████████████| 22418/22418 [00:00<00:00, 101603.13it/s] 100%|██████████████████████████████████| 22418/22418 [00:00<00:00, 93638.18it/s]
22418
# load trained model
import pickle
with open('multiclass.pkl', 'rb') as f:
multiclass = pickle.load(f)
from sklearn.model_selection import train_test_split
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
#fit model with training data
multiclass.fit(X_train, y_train)
#evaluation on test data
y_pred = multiclass.predict(X_test)
from sklearn.metrics import confusion_matrix
confusion_matrix(y_pred, y_test)
array([[ 991, 29, 29, 27, 34],
[ 24, 1046, 36, 13, 31],
[ 29, 24, 1000, 13, 34],
[ 18, 39, 17, 1009, 30],
[ 39, 23, 30, 32, 1008]])
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
precision recall f1-score support
B 0.89 0.90 0.90 1101
G 0.91 0.90 0.91 1161
R 0.91 0.90 0.90 1112
U 0.91 0.92 0.91 1094
W 0.89 0.89 0.89 1137
accuracy 0.90 5605
macro avg 0.90 0.90 0.90 5605
weighted avg 0.90 0.90 0.90 5605
# Read in magic data
df = (
pd.read_feather('../../../data/mtg.feather')
.dropna(subset=['flavor_text', 'text'])
.reset_index(drop=True)
)
from sklearn.preprocessing import MultiLabelBinarizer
MLB = MultiLabelBinarizer()
y = MLB.fit_transform(df.color_identity)
MLB.classes_
array(['B', 'G', 'R', 'U', 'W'], dtype=object)
from my_functions import preprocess
clean_flavor_text = preprocess(df.flavor_text)
clean_text = preprocess(df.text)
X = []
for i in range(len(clean_text)):
text_concat = clean_text[i] + ". " + clean_flavor_text[i]
X.append(text_concat)
len(X)
100%|█████████████████████████████████| 29635/29635 [00:00<00:00, 101270.67it/s] 100%|██████████████████████████████████| 29635/29635 [00:00<00:00, 90429.14it/s]
29635
# load trained model
import pickle
with open('multilabel.pkl', 'rb') as f:
multilabel = pickle.load(f)
from sklearn.model_selection import train_test_split
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
#fit model with training data
multilabel.fit(X_train, y_train)
#evaluation on test data
y_pred = multilabel.predict(X_test)
from sklearn.metrics import multilabel_confusion_matrix
print(multilabel_confusion_matrix(y_test,y_pred))
[[[5723 151] [ 197 1338]] [[5615 159] [ 201 1434]] [[5668 140] [ 188 1413]] [[5715 178] [ 167 1349]] [[5544 204] [ 179 1482]]]
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred,target_names=MLB.classes_,zero_division=1))
precision recall f1-score support
B 0.90 0.87 0.88 1535
G 0.90 0.88 0.89 1635
R 0.91 0.88 0.90 1601
U 0.88 0.89 0.89 1516
W 0.88 0.89 0.89 1661
micro avg 0.89 0.88 0.89 7948
macro avg 0.89 0.88 0.89 7948
weighted avg 0.89 0.88 0.89 7948
samples avg 0.92 0.90 0.87 7948
I am actually blown away by how well both models did on the test set. Both managed to achive high precision, recall and f1-scores. I feel like with results this good, my spidey senses should be firing off. I really want to know what I did wrong in the preprocessing pipelines that returned these results.
Can we predict the EDHREC "rank" of the card using the data we have available?
predicted vs. actual rank, with a 45-deg line showing what "perfect prediction" should look like. How did you do? What would you like to try if you had more time?
I tried a lot of things, but the ones that I ended up going with are:
Things that didn't work:
If I had more time, I would've liked to:
# load trained model
import pickle
with open('regression.pkl', 'rb') as f:
reg = pickle.load(f)
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from my_functions import multi
df = (
pd.read_feather('../../../data/mtg.feather')
.dropna(subset = ['edhrec_rank'])
.reset_index(drop=True)
)
# Source: https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html
numeric_features = ["converted_mana_cost"]
numeric_transformer = Pipeline(
steps=[("scaler", MinMaxScaler())]
)
cat_features = ["block","rarity"]
cat_transformer = OneHotEncoder(handle_unknown="ignore")
multi_label = multi(df,["types","subtypes", "color_identity","supertypes"])
X = pd.concat([df[['converted_mana_cost','rarity',"block"]],multi_label], axis = 1)
y = df['edhrec_rank']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
#fit model with training data
reg.fit(X_train, y_train)
#evaluation on test data
y_pred = reg.predict(X_test)
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
print(mean_squared_error(y_test, y_pred),mean_absolute_error(y_test, y_pred),r2_score(y_test,y_pred))
22205662.935580887 3333.3070807808263 0.46898020211749236
import matplotlib.pyplot as plt
#Source: https://stackoverflow.com/questions/58410187/how-to-plot-predicted-values-vs-the-true-value
plt.figure(figsize=(10,10))
plt.scatter(y_test, y_pred, c='crimson')
# plt.yscale('log')
# plt.xscale('log')
p1 = max(max(y_pred), max(y_test))
p2 = min(min(y_pred), min(y_test))
plt.plot([p1, p2], [p1, p2], 'b-')
plt.xlabel('True Values', fontsize=15)
plt.ylabel('Predictions', fontsize=15)
plt.axis('equal')
plt.show()
from sklearn.inspection import permutation_importance
vi = permutation_importance(reg,X_train,y_train,n_repeats=5)
# Organize as a data frame
vi_dat = pd.DataFrame(dict(variable=X_train.columns,
vi = vi['importances_mean'],
std = vi['importances_std']))
# Generate intervals
vi_dat['low'] = vi_dat['vi'] - 2*vi_dat['std']
vi_dat['high'] = vi_dat['vi'] + 2*vi_dat['std']
# But in order from most to least important
vi_dat = vi_dat.sort_values(by="vi",ascending=False).reset_index(drop=True)
vi_dat
| variable | vi | std | low | high | |
|---|---|---|---|---|---|
| 0 | rarity | 6.482141e-01 | 0.003268 | 6.416790e-01 | 6.547493e-01 |
| 1 | block | 4.965136e-01 | 0.004339 | 4.878349e-01 | 5.051923e-01 |
| 2 | converted_mana_cost | 4.661245e-01 | 0.001588 | 4.629486e-01 | 4.693005e-01 |
| 3 | Creature | 2.546113e-01 | 0.001684 | 2.512425e-01 | 2.579802e-01 |
| 4 | R | 9.099565e-02 | 0.000885 | 8.922537e-02 | 9.276594e-02 |
| ... | ... | ... | ... | ... | ... |
| 349 | Coward | 4.645543e-08 | 0.000000 | 4.645543e-08 | 4.645543e-08 |
| 350 | Koth | 2.878551e-08 | 0.000000 | 2.878551e-08 | 2.878551e-08 |
| 351 | Tyvar | 2.308796e-08 | 0.000000 | 2.308796e-08 | 2.308796e-08 |
| 352 | Fractal | 0.000000e+00 | 0.000000 | 0.000000e+00 | 0.000000e+00 |
| 353 | Nautilus | 0.000000e+00 | 0.000000 | 0.000000e+00 | 0.000000e+00 |
354 rows × 5 columns
top10 = vi_dat[0:10]
from plotnine import *
# Plot
(
ggplot(top10,
aes(x="variable",y="vi")) +
geom_col(alpha=.5) +
geom_point() +
geom_errorbar(aes(ymin="low",ymax="high"),width=.2) +
scale_x_discrete(limits=top10.variable.tolist()) +
coord_flip() +
labs(y="Reduction in AUC ROC",x="")
)
<ggplot: (827719078)>
I picked my multilabel model, which has already performed pretty well early-on , according to its F-1 score, precision and recall. In this experiment, I picked the label_ranking_score as the metrics to measure.
I wanted to tune the loss measurement and penalty function of LinearSVC's. The default loss function is squared_hinge loss. In my experiment, I changed the default loss function to hinge and recorded the change in metrics.json. Initially I wanted to change the L2 norm to L1 norm as well, but L1 norm does not work with hinge loss.
!dvc exp diff
Path Metric HEAD workspace Change metrics.json label_ranking_loss 0.11119 0.11358 0.0023845 Path Param HEAD workspace Change params.yaml LinearSVC.loss squared_hinge hinge diff not supported
Keeping the default L2 penalty and change the loss function to hinge adds 0.0023645 to the label ranking loss. The closer label ranking loss is to 0 the better so this is not optimal.
!dvc exp diff
Path Metric HEAD workspace Change metrics.json label_ranking_loss 0.11119 0.11244 0.0012485 metrics.json use_idf True False -1 Path Param HEAD workspace Change params.yaml TfidfTransformer.use_idf True False -1
Changing use_idf to False in TFIDF added more to label_ranking_loss
!dvc exp diff
Path Metric HEAD workspace Change metrics.json label_ranking_loss 0.11119 0.13832 0.027129 Path Param HEAD workspace Change params.yaml CountVectorizer.ngram_range.min_n 1 2 1
And changing the range of the ngram in countvectorizer to (2,2) instead of (1,2) also made the multilabel classifier worse
!dvc exp diff
Path Metric HEAD workspace Change metrics.json label_ranking_loss 0.11119 0.10573 -0.0054663 Path Param HEAD workspace Change params.yaml CountVectorizer.ngram_range.max_n 2 3 1
However, the one thing that did improve the classifier is changing the ngram of countvectorizer to (1,3) instead of (1,2)